Objective of this project is to predict a movie revenue based on historic data about movie revenues and performance at global box office.
Such a prediction would be useful for optimization in many areas during various stages of planning and production of movies, for instance, selection of actors, crew, location, production spend, marketing spend, logistics and so on. Predicting revenue potential would enable movie production enterprises make wise investment decisions, come up with movies with plots relevant to society, higher entertainment satisfaction and ultimately greater good of all the involved parties.
In this public competition hosted by Kaggle, I am presented with metadata on past films from The Movie Database to try and predict their overall worldwide box office revenue. Data points provided include cast, crew, plot keywords, budget, posters, release dates, languages, production companies, and countries.
Data is downloaded from this link https://www.kaggle.com/c/tmdb-box-office-prediction/data
Since the dataset is small enough, I made a local copy and checked in to git along with project.
!conda env list
# Import usual modules to get going, basic visualiation, date, Json, progress bar, ....
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
from tqdm import tqdm
from datetime import datetime
import time
import json
# Import required sklearn modules
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import KFold, train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import mean_squared_log_error
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import GridSearchCV
# Import main algorithm modules
import lightgbm as lgb
import xgboost as xgb
from catboost import CatBoostRegressor
# Follwing modules are for visualizations
from wordcloud import WordCloud
import eli5
import shap
shap.initjs()
warnings.filterwarnings("ignore")
%matplotlib inline
Review data structure, get a feel for data types and null values
train = pd.read_csv('data/train.csv')
train.info()
There are 3000 records in dataset, looking at data types and null values in few of the columns, it will be an interesting challenge with data exploration and exploratory data analysis.
Review test data set as well...
test = pd.read_csv('data/test.csv')
test.info()
Let us jump in to some exploratory data analysis, take a look at few records to get a feel of actual data.
We have some JSON key value pairs, free form text, categorical variables and continuous variables, as well as release date. Dream come true for aspiring ML Engineer to play with this kind of data!
train.head()
Descriptive stats in train and test datasets.
train.describe(include='all')
test.describe(include='all')
Count of missing values in each column, in both train and test datasets. Gives an early indication on which columns need appropriate handling for missing values and strategies for handling them.
train.isna().sum()
test.isna().sum()
Using pandas_profiling to accelerate data exploration. For details on this module -> https://pandas-profiling.github.io/pandas-profiling/docs/
And a good article discussing advantages of using such package for data science efforts -> https://towardsdatascience.com/a-better-eda-with-pandas-profiling-e842a00e1136
import pandas_profiling
train.profile_report(style={'full_width':True})
Joint plots to start visualizing relationships between numeric variables...
Budget vs Revenue
sns.jointplot(x="budget", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()
Popularity vs Revenue
sns.jointplot(x="popularity", y="revenue", data=train, height=10, ratio=5, color="b")
plt.show()
Runtime vs Revenue
sns.jointplot(x="runtime", y="revenue", data=train, height=10, ratio=5, color="g")
plt.show()
Revenue Distribution
Majority of movies are near the zero line, 75% of them under 68 Million. So, we need to apply log scaling to make the data distribution conducive to analysis/modeling
sns.distplot(train.revenue)
train.revenue.describe()
Apply log1p function to take care of any odd cases with 0 revenues breaking application of log
train['logRevenue'] = np.log1p(train['revenue'])
sns.distplot(train['logRevenue'])
train['release_date'].head()
Since only last two digits of the year are provided, let us manipulate and get correct year
# Split release date to individual parts, input is just a string sepearated by /
train[['release_month','release_day','release_year']]=train['release_date'].str.split('/',expand=True).replace(np.nan, -1).astype(int)
train['release_year'].max()
# this might sound like odd logic, but that is best we can do, as we are in 2019!
train.loc[ (train['release_year'] <= 19) & (train['release_year'] < 100) , "release_year"] += 2000
train.loc[ (train['release_year'] > 19) & (train['release_year'] < 100) , "release_year"] += 1900
Additional Features from Release Date
# Make new features from date, day of week and quarter...
releaseDate = pd.to_datetime(train['release_date'])
train['release_dayofweek'] = releaseDate.dt.dayofweek
train['release_quarter'] = releaseDate.dt.quarter
Plot Release Year Count - shows significant uptick of movie releases from 1980's onwards, makes sense as society becomes more affluent, with noticeable dips due to economical conditions.
plt.figure(figsize=(20,10))
sns.countplot(train['release_year'].sort_values())
plt.title("Movie Release counts by year")
loc, labels = plt.xticks()
plt.xticks(rotation=90)
plt.show()
Plot Release Month Counts Shows month wise variations in release counts, interesting observation that late summer and early fall has maximum number of releases.
plt.figure(figsize=(20,10))
sns.countplot(train['release_month'].sort_values())
plt.title("Movie Release counts by Month")
loc, labels = plt.xticks()
loc, labels = loc, ["Jan", "Feb", "Mar", "Apr", "May", "Jun", "Jul", "Aug", "Sep", "Oct", "Nov", "Dec"]
plt.xticks(loc, labels)
plt.show()
Plot Release Day Counts Mostly stable, First day of the month and middle of the month have many releases - Does it match with pay patterns of twice a month payments for salaried moviegoers? May be, but would be a tangent for problem at hand.
plt.figure(figsize=(20,10))
sns.countplot(train['release_day'].sort_values())
plt.title("Release Day Count")
plt.xticks()
plt.show()
Plot Release Day of Week Perfect, matches with intuition here, majority of movies are released on Fridays' followed by Thursdays' as moviegoers are eager to spend sometime on their weekend for entertainment.
plt.figure(figsize=(20,10))
sns.countplot(train['release_dayofweek'].sort_values())
plt.title("Movies released on day of week")
loc, labels = plt.xticks()
loc, labels = loc, ["Mon", "Tue", "Wed", "Thu", "Fri", "Sat", "Sun"]
plt.xticks(loc, labels)
plt.show()
Plot Release quarter Counts No surprises here, matches with monthly trends, Summer and Fall have higher releases compared to other two quarters.
plt.figure(figsize=(20,10))
sns.countplot(train['release_quarter'].sort_values())
plt.title("Movies released in quarter")
plt.show()
Plot Release Year vs Revenue Good, matches with intuition and upward trend with movie release counts from 80's following valleys due to economic fluctuations.
# Using Pandas groupby method to get mean revenues by year.
train['meanRevenueByYear'] = train.groupby("release_year")["revenue"].aggregate('mean')
train['meanRevenueByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Year")
plt.show()
Release Month vs Revenue Plot Interesting observation here, mean revenue peaks at June, and hits bottom during September (when maximum movies are released). Supply and demand?!
# Using Pandas groupby method to get mean revenues by release month.
train['meanRevenueByMonth'] = train.groupby("release_month")["revenue"].aggregate('mean')
train['meanRevenueByMonth'].plot(figsize=(20,10),color="r")
plt.xlabel("Release Month")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue Release Month")
plt.show()
Release Day of Week vs Revenue Graph indicates that Wednesday releases have peak revenue compared to Friday and Sunday. However many movies are released on Friday followed by Thursday. This is not significant in our analysis as it would suggest.
# Using Pandas groupby method to get mean revenues by release day of the week.
train['meanRevenueByDayOfWeek'] = train.groupby("release_dayofweek")["revenue"].aggregate('mean')
train['meanRevenueByDayOfWeek'].plot(figsize=(20,10),color="r")
plt.xlabel("Day of Week")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue by Day of Week")
plt.show()
Movie Mean Runtime by year Visualizing average runtime of movies, we can see a trend movies were very long in early phases averaging 150 minutes, They hit a minimum during great recession, were bouncing around little less than 2 hour mark. It is interesting to mote that the average peaked just after II world war and again after Vietnam war? I think it is tangent exploring those aspects, it is more or less stable around 2 hours mark, let us move on...
# Using Pandas groupby method to get mean runtime by release year.
train['meanruntimeByYear'] = train.groupby("release_year")["runtime"].aggregate('mean')
train['meanruntimeByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Runtime")
plt.title("Movie Mean Runtime by Year")
plt.show()
Mean popularity by Year General uptick in movie popularity, no surprises, except a huge rise in 2016
# Using Pandas groupby method to get mean popularity by release year.
train['meanPopularityByYear'] = train.groupby("release_year")["popularity"].aggregate('mean')
train['meanPopularityByYear'].plot(figsize=(20,15),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Popularity")
plt.title("Movie Mean Popularity by Year")
plt.show()
Movie Mean Budget by Year General uptick in average budget spent on movies, matches with intuition on increased number of movies made over years. Matches with economic trends as well.
# Using Pandas groupby method to get mean budgets by release year.
train['meanBudgetByYear'] = train.groupby("release_year")["budget"].aggregate('mean')
train['meanBudgetByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("Budget")
plt.title("Movie Mean Budget by Year")
plt.show()
Count Genres in training dataset No surprises here, moviegoers go to movies to experience drama, data shows at least half of the movies are "Drama" genre. Followed by one third of them in "Comedy", one fourth in "Thriller" and "Action"
# Simple helper function to evaluate string which is used to build dict of uique Genres
def get_dictionary(s):
try:
d = eval(s)
except:
d = {}
return d
train = train
# Expand out JSON string with Genres, one hot encode and sum on Genres columns.
train['genres'] = train['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = train.genres.str.get_dummies(sep=',')
train = pd.concat([train, genres], axis=1, sort=False)
for col in genres:
print(col, "Genres Movie : ", genres[col].sum())
Count Genres in test dataset A quick check on test dataset to see if there are any oddities on Genres column, matches with overall trends, no issues here.
test = test
# Expand out JSON string with Genres, one hot encode and sum on Genres columns.
test['genres'] = test['genres'].map(lambda x: sorted([d['name'] for d in get_dictionary(x)])).map(lambda x: ','.join(map(str, x)))
genres = test.genres.str.get_dummies(sep=',')
test = pd.concat([test, genres], axis=1, sort=False)
for col in genres:
print(col, "Genres Movie : ", genres[col].sum())
Original Language Counts Most movies, > 80% are English probably due to popularity of English language and popularity of Hollywood movies, would be concerning to generalize the analysis and model we build to worldwide predictions, especially when movies are becoming popular worldwide and other languages are catching up quickly.
plt.figure(figsize=(20,10))
sns.countplot(train['original_language'].sort_values())
plt.title("Original Language Counts")
plt.show()
Status Analysis A quick peak at counts of "status" of movies
train['status'].value_counts()
Do we have revenue for 4 movies? how is it possible for movies not released yet? Something to consider while doing feature engineering on how to handle these exceptions.
train.loc[train['status'] == "Rumored"][['status','revenue']]
How about in test dataset? 4389 movies released, 7 are yet to be released.
test['status'].value_counts()
_Homepage analysis. How many movies have a homepage? Could be used as a binary feature.
train['has_homepage'] = 1
train.loc[pd.isnull(train['homepage']) ,"has_homepage"] = 0
plt.figure(figsize=(20,10))
sns.countplot(train['has_homepage'].sort_values())
plt.title("Has Homepage?")
plt.show()
Correlation between has_homepage and revenue From visualization, we can infer that movies with home page generally have higher revenues, they are more popular.
sns.catplot(x="has_homepage", y="revenue", data=train)
plt.title("Revenue of movies with/wihtout homepage")
Does Movie has tagline? How does it affect revenue? Analysis indicates movies with Tagline are more popular and command more revenues.
train['isTaglineNA'] = 0
train.loc[pd.isnull(train['tagline']) ,"isTaglineNA"] = 1
sns.catplot(x="isTaglineNA", y="revenue", data=train)
plt.title('Revenue of movies with/without tagline');
Is movie title different? Some movies have same titles, analyzing effect of title being same and different with categorical plot and impact on revenue, we can conclude that if title is same movies command more revenue, probably effect of previous title success carrying over to new release with same title.
train['isTitleDifferent'] = 1
train.loc[ train['original_title'] == train['title'] ,"isTitleDifferent"] = 0
sns.catplot(x="isTitleDifferent", y="revenue", data=train)
plt.title('Revenue of movies with single and multiple titles');
What is the effect of orginal language if it English on revenue? Agrees with analysis of Original Language Most of our training dataset is English movies.
train['isOriginalLanguageEng'] = 0
train.loc[train['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
sns.catplot(x="isOriginalLanguageEng", y="revenue", data=train)
plt.title('Revenue of movies when Original Language is English and Not English');
We do have additional dataset with rating and total votes for movies. It would be interesting to augment given dataset with additional features, gives me an opportunity to explore joins, missing records due to joins, impute or decide on what to do with missing values. Keep building skills!
trainAdditionalFeatures = pd.read_csv('data/TrainAdditionalFeatures.csv')
testAdditionalFeatures = pd.read_csv('data/TestAdditionalFeatures.csv')
train = pd.merge(train, trainAdditionalFeatures, how='left', on=['imdb_id'])
test = pd.merge(test, testAdditionalFeatures, how='left', on=['imdb_id'])
trainAdditionalFeatures.head()
Missing value analysis in additional datasets
train['rating'].isna().sum()
train['totalVotes'].isna().sum()
test['rating'].isna().sum()
test['totalVotes'].isna().sum()
Quite a few additional features have missing records. So, let us fill them with some reasonable defaults, 5 votes and a rating of 1.5, cautious defaulting to low values.
train['rating'] = train['rating'].fillna(1.5)
train['totalVotes'] = train['totalVotes'].fillna(5)
test['rating'] = test['rating'].fillna(1.5)
test['totalVotes'] = test['totalVotes'].fillna(5)
Training set Rating Visualization Apart from the defaulting for missing values, data follows roughly normal distribution, great!
plt.figure(figsize=(20,10))
sns.countplot(train['rating'].sort_values())
plt.title("Training dataset Rating Count")
plt.show()
Testing dataset rating visualization Roughly follows training dataset trends, no concerns on using this as feature.
plt.figure(figsize=(20,10))
sns.countplot(test['rating'].sort_values())
plt.title("Test dataset Rating Count")
plt.show()
Let us explore significance of rating on Mean Revenue in Training dataset. A rating of 6 has peak revenue, followed by 7 and 8, this would be good feature to predict on!
# Using Pandas groupby method to get mean revenue by ratings
train['meanRevenueByRating'] = train.groupby("rating")["revenue"].aggregate('mean')
train['meanRevenueByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Rating")
plt.show()
Mean Revenue vs Total Votes visualization. Movies with votes in the range of 900 to 1600 tend to command good revenues
# Using Pandas groupby method to get mean revenue by total votes
train['meanRevenueByTotalVotes'] = train.groupby("totalVotes")["revenue"].aggregate('mean')
train['meanRevenueByTotalVotes'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(0,3000,1000))
plt.xlabel("Total Votes")
plt.ylabel("Revenue")
plt.title("Movie Mean Revenue By Total Votes")
plt.show()
trends of total votes over release year follow overall uptick with movie releases over year, gives confidence we could use this as predictor feature.
# Using Pandas groupby method to get mean total votes by release year
train['meantotalVotesByYear'] = train.groupby("release_year")["totalVotes"].aggregate('mean')
train['meantotalVotesByYear'].plot(figsize=(20,10),color="r")
plt.xticks(np.arange(1920,2018,4))
plt.xlabel("Release Year")
plt.ylabel("TotalVotes")
plt.title("Movie Mean Total Votes By Release Year")
plt.show()
Total Votes vs Rating Visualization to analyze how they could potentially be interrelated.
# Using Pandas groupby method to get mean total votes by ratings
train['meanTotalVotesByRating'] = train.groupby("rating")["totalVotes"].aggregate('mean')
train['meanTotalVotesByRating'].plot(figsize=(20,10),color="r")
plt.xlabel("Rating")
plt.ylabel("Total Votes")
plt.title("Movie Mean Total Votes by Rating")
plt.show()
# Visualize a word Cloud for original title, to get a feel fo what words are important...
plt.figure(figsize = (12, 12))
text = ' '.join(train['original_title'].values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in titles')
plt.axis("off")
plt.show()
# Visualize a word Cloud for Overview, to get a feel fo what words are important...
plt.figure(figsize = (12, 12))
text = ' '.join(train['overview'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in overview')
plt.axis("off")
plt.show()
# Visualize a word Cloud for Tagline, to get a feel fo what words are important...
plt.figure(figsize = (12, 12))
text = ' '.join(train['tagline'].fillna('').values)
wordcloud = WordCloud(max_font_size=None, background_color='white', width=1200, height=1000).generate(text)
plt.imshow(wordcloud)
plt.title('Top words in tagline')
plt.axis("off")
plt.show()
Let us do a Heat Map for correlation visualizations
As expected, total votes and revenue show very high correlation at 0.77, as highly voted movies perform well at box office, followed by budget and revenue at 0.75, which makes sense as well as highly popular movies which collect large revenues typically have huge budgets. Popularity takes up next spot at 0.46 correlation to revenue, runtime at 0.22. It is surprising to see rating is only at 0.17. Release year and release day of week and release month show very weak to very weak negative correlation.
# Heatmap visusalization to better understand cross corelations. So far we have been doing two at a time.
train_viz = train[['budget','rating','totalVotes','popularity','runtime','release_year','release_month','release_dayofweek','revenue']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(train_viz.corr(), annot=True)
plt.show()
With thorough understanding of data used, including additional features, let us tackle feature engineering. We could do this modern ML Engineering way with breaking down into modules, building tests etc., however, I chose to do this within notebook to stay focused on project and finishing it on time, instead of getting side tracked with beautification and production grade hardening of ML Feature engineering pipeline.
Putting together various data exploration activities, transformation snippets, handing JSON, modularizing where possible, wrote a data_pref function.
# Helper method to get dictionary by evaluating argument
def get_dictionary(s):
try:
d = eval(s)
except:
d = {}
return d
# Helper method to get dict with counts for keys in a given list of JSON columns
def get_json_dict(df) :
global json_cols
result = dict()
for e_col in json_cols :
d = dict()
rows = df[e_col].values
for row in rows :
if row is None : continue
for i in row :
if i['name'] not in d :
d[i['name']] = 0
d[i['name']] += 1
result[e_col] = d
return result
# The main data preparation/feature engineering method
def data_prep(df):
# release_date handling, splitting and taking care of double digit years.
df[['release_month','release_day','release_year']]=df['release_date'].str.split('/',expand=True).replace(np.nan, 0).astype(int)
df['release_year'] = df['release_year']
# This logic with hardcoding of 19 may sound odd, that is best we could do with given data
# we are hardcoding 19, the years we are in...
df.loc[ (df['release_year'] <= 19) & (df['release_year'] < 100), "release_year"] += 2000
df.loc[ (df['release_year'] > 19) & (df['release_year'] < 100), "release_year"] += 1900
releaseDate = pd.to_datetime(df['release_date'])
df['release_dayofweek'] = releaseDate.dt.dayofweek
df['release_quarter'] = releaseDate.dt.quarter
# Missing value handling for rating and total votes, they come from additional features dataset.
# as we have many missing values, following logic imputes yearly means and merges.
rating_na = df.groupby(["release_year","original_language"])['rating'].mean().reset_index()
df[df.rating.isna()]['rating'] = df.merge(rating_na, how = 'left' ,on = ["release_year","original_language"])
vote_count_na = df.groupby(["release_year","original_language"])['totalVotes'].mean().reset_index()
df[df.totalVotes.isna()]['totalVotes'] = df.merge(vote_count_na, how = 'left' ,on = ["release_year","original_language"])
# default any NaN to default values of 1.5 for rating and 5 for total votes, they fall outside the valid
# range, won't break algorithms like NaNs do!
df['rating'] = df['rating'].fillna(1.5)
df.loc[ (df['rating'] == 0 ), "rating"] += 1.5
df['totalVotes'] = df['totalVotes'].fillna(5)
df.loc[ (df['totalVotes'] == 0 ), "totalVotes"] += 1.5
# Ranges of rating and total votes are order of magnitude different. Let us get some wieighted rating
df['weightedRating'] = ( df['rating']*df['totalVotes'] + 6.367 * 1000 ) / ( df['totalVotes'] + 1000 )
# Budget amount... adjust for infation and take log, to adjust for skewness
df['originalBudget'] = df['budget']
df['inflationBudget'] = df['budget'] + df['budget']*2.1/100*(2018-df['release_year']) #Inflation assuming simplistic 2.1% per year
df['budget'] = np.log1p(df['inflationBudget'])
# Gender aggregations
df['genders_0_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 0]))
df['genders_1_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 1]))
df['genders_2_crew'] = df['crew'].apply(lambda x: sum([1 for i in x if i['gender'] == 2]))
# small number of cases where runtime is 0... impute with mean runtime
df.loc[ (df['runtime'] == 0 ), "runtime"] += df['runtime'].mean()
# Handle collections, perform label encoding, converting labels to numerical features
df['_collection_name'] = df['belongs_to_collection'].apply(lambda x: x[0]['name'] if x != {} else 0)
le = LabelEncoder()
le.fit(list(df['_collection_name'].fillna('')))
df['_collection_name'] = le.transform(df['_collection_name'].fillna('').astype(str))
# count number of keywords, number of cast
df['_num_Keywords'] = df['Keywords'].apply(lambda x: len(x) if x != {} else 0)
df['_num_cast'] = df['cast'].apply(lambda x: len(x) if x != {} else 0)
# compute mean poularity by release year
df['_popularity_mean_year'] = df['popularity'] / df.groupby("release_year")["popularity"].transform('mean')
# Compute few ratios with budget... to runtime, popularity, release year
df['_budget_runtime_ratio'] = df['budget']/df['runtime']
df['_budget_popularity_ratio'] = df['budget']/df['popularity']
df['_budget_year_ratio'] = df['budget']/(df['release_year'])
# Compute release year to poularity as well as inverse ratio
df['_releaseYear_popularity_ratio'] = df['release_year']/df['popularity']
df['_releaseYear_popularity_ratio2'] = df['popularity']/df['release_year']
# More ratios of Popularity to total votes, ratings, release years
df['_popularity_totalVotes_ratio'] = df['totalVotes']/df['popularity']
df['_rating_popularity_ratio'] = df['rating']/df['popularity']
df['_rating_totalVotes_ratio'] = df['totalVotes']/df['rating']
df['_totalVotes_releaseYear_ratio'] = df['totalVotes']/df['release_year']
# even more ratios of budget to rating, runtime to rating, budget by total votes
df['_budget_rating_ratio'] = df['budget']/df['rating']
df['_runtime_rating_ratio'] = df['runtime']/df['rating']
df['_budget_totalVotes_ratio'] = df['budget']/df['totalVotes']
# homepage - check for missing values
df['has_homepage'] = 1
df.loc[pd.isnull(df['homepage']) ,"has_homepage"] = 0
# belongs to collection - check for missingvalues
df['isbelongs_to_collectionNA'] = 0
df.loc[pd.isnull(df['belongs_to_collection']) ,"isbelongs_to_collectionNA"] = 1
# Tagline - check for missing values
df['isTaglineNA'] = 0
df.loc[df['tagline'] == 0 ,"isTaglineNA"] = 1
# Flag for original English language
df['isOriginalLanguageEng'] = 0
df.loc[ df['original_language'] == "en" ,"isOriginalLanguageEng"] = 1
# Flag for title change betwee original and current title
df['isTitleDifferent'] = 1
df.loc[ df['original_title'] == df['title'] ,"isTitleDifferent"] = 0
# Flag for movie release status
df['isMovieReleased'] = 1
df.loc[ df['status'] != "Released" ,"isMovieReleased"] = 0
# extract collection id from belongs_to_collection
df['collection_id'] = df['belongs_to_collection'].apply(lambda x : np.nan if len(x)==0 else x[0]['id'])
# get counts of letters/words for original title
df['original_title_letter_count'] = df['original_title'].str.len()
df['original_title_word_count'] = df['original_title'].str.split().str.len()
# get word count for title, overiew, tagline
df['title_word_count'] = df['title'].str.split().str.len()
df['overview_word_count'] = df['overview'].str.split().str.len()
df['tagline_word_count'] = df['tagline'].str.split().str.len()
# get count of production countries and companies, crew and cast
df['production_countries_count'] = df['production_countries'].apply(lambda x : len(x))
df['production_companies_count'] = df['production_companies'].apply(lambda x : len(x))
df['cast_count'] = df['cast'].apply(lambda x : len(x))
df['crew_count'] = df['crew'].apply(lambda x : len(x))
# get aggregates by release year, ratings vs votes
df['meanruntimeByYear'] = df.groupby("release_year")["runtime"].aggregate('mean')
df['meanPopularityByYear'] = df.groupby("release_year")["popularity"].aggregate('mean')
df['meanBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('mean')
df['meantotalVotesByYear'] = df.groupby("release_year")["totalVotes"].aggregate('mean')
df['meanTotalVotesByRating'] = df.groupby("rating")["totalVotes"].aggregate('mean')
df['medianBudgetByYear'] = df.groupby("release_year")["budget"].aggregate('median')
# For JSON columns, expand out and get counts per key
for col in ['genres', 'production_countries', 'spoken_languages', 'production_companies']:
df[col] = df[col].map(lambda x: sorted(list(set([n if n in train_dict[col] else col+'_etc' for n in [d['name'] for d in x]])))).map(lambda x: ','.join(map(str, x)))
temp = df[col].str.get_dummies(sep=',')
df = pd.concat([df, temp], axis=1, sort=False)
# eliminate low frequency keys
df.drop(['genres_etc'], axis = 1, inplace = True)
# Drop source columns, as we have well engineered features ready to be modeled now!
df = df.drop(['id', 'revenue','belongs_to_collection','genres','homepage','imdb_id','overview','runtime'
,'poster_path','production_companies','production_countries','release_date','spoken_languages'
,'status','title','Keywords','cast','crew','original_language','original_title','tagline', 'collection_id'
],axis=1)
# replace any pesky NaNs still left with 0's, should not be any!
df.fillna(value=0.0, inplace = True)
return df
# start of actual work, read both test and train datasets
train = pd.read_csv('data/train.csv')
test = pd.read_csv('data/test.csv')
# set dummy value in test dataset, it is for target, missing in test dataset.
test['revenue'] = np.nan
# Read additional feature datastes, merge with train and test datasets on IMDB_ID
train = pd.merge(train, pd.read_csv('data/TrainAdditionalFeatures.csv'), how='left', on=['imdb_id'])
test = pd.merge(test, pd.read_csv('data/TestAdditionalFeatures.csv'), how='left', on=['imdb_id'])
Quick sanity check on columns read and shape of dataframes
print(train.columns)
print(train.shape)
print(test.columns)
print(test.shape)
Handling JSON expansions, dropping low frequency values to optimize feature matrix.
# Apply log1p transformation on target, as it is skewed to the left.
train['revenue']= np.log1p(train['revenue'])
y = train['revenue'].values
# JSON specific processing, makes use of helper functions to build keys, value counts
json_cols = ['genres', 'belongs_to_collection', 'production_companies', 'production_countries', 'spoken_languages', 'Keywords', 'cast', 'crew']
for col in tqdm(json_cols) :
train[col] = train[col].apply(lambda x : get_dictionary(x))
test[col] = test[col].apply(lambda x : get_dictionary(x))
train_dict = get_json_dict(train)
test_dict = get_json_dict(test)
# remove categories with bias and low frequency
for col in json_cols :
remove = []
train_id = set(list(train_dict[col].keys()))
test_id = set(list(test_dict[col].keys()))
remove += list(train_id - test_id) + list(test_id - train_id)
for i in train_id.union(test_id) - set(remove) :
if train_dict[col][i] < 10 or i == '' :
remove += [i]
for i in remove :
if i in train_dict[col] :
del train_dict[col][i]
if i in test_dict[col] :
del test_dict[col][i]
Merge both datasets, apply common feature engineering processing.
# Merge test and train datasets, perform feature engineering...
all_data = data_prep(pd.concat([train, test]).reset_index(drop = True))
#Split feature engineered data in to train and test datasets
train = all_data.loc[:train.shape[0] - 1,:]
test = all_data.loc[train.shape[0]:,:]
Decided to use a simple DecisionTreeRegressor with default arguments as the benchmark model. Other models' performance metrics will be compared to this model's performance for the chosen metric, minimizing RSMLE score for this regression prediction problem.
Let us split data in to train and validation with 80 to 20 ratio.
# A simple test and validation split with 80% trainig dataset and 20% validation dataset
# for benchmark model.
x_train, x_valid, y_train, y_valid = train_test_split(train,
y,
test_size=0.2,
random_state = 7)
# Show the results of the split
print("Training set has {} samples.".format(x_train.shape[0]))
print("validation set has {} samples.".format(x_valid.shape[0]))
For benchmark model, RMSE is at 3.19, which serves as benchmark for further models to compare model prediction performance
# Fit regression model (benchmark model)
regr = DecisionTreeRegressor()
regr.fit(x_train, y_train)
# Predict
y_val_pred = regr.predict(x_valid)
# RMSLE on validation set
def rmsle(y_true, y_pred):
return 'RMSLE', np.sqrt(np.mean(np.power(np.log1p(y_pred) - np.log1p(y_true), 2))), False
print(rmsle(y_val_pred, y_valid))
print("Mean Squared Log Error for Benchmark Model: {}".format(np.sqrt(mean_squared_log_error( y_valid, y_val_pred ))))
print("Mean Squared Error for Benchmark Model: {}".format(np.sqrt(mean_squared_error( y_valid, y_val_pred ))))
# with defatul values
# (boosting_type='gbdt', num_leaves=31, max_depth=-1, learning_rate=0.1, n_estimators=100,
# subsample_for_bin=200000, objective=None, class_weight=None,
# min_split_gain=0.0, min_child_weight=0.001, min_child_samples=20,
# subsample=1.0, subsample_freq=0, colsample_bytree=1.0, reg_alpha=0.0, reg_lambda=0.0,
# random_state=None, n_jobs=-1, silent=True, importance_type='split', **kwargs)
core_params = {
'objective': 'regression', # regression, multiclass, binary
'metric': 'rmse' # binary_logloss, mse, mae
}
modelilgb = lgb.LGBMRegressor(**core_params)
modelilgb.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_valid, y_valid)], eval_metric='rmse',
verbose=1000, early_stopping_rounds=5)
# Train XGBoost model with default parameters
# (params, dtrain, num_boost_round=10, evals=(), obj=None,
# feval=None, maximize=False, early_stopping_rounds=None, evals_result=None,
# verbose_eval=True, xgb_model=None, callbacks=None, learning_rates=None)
params = {'eta': 0.3,
'objective': 'reg:linear',
'max_depth': 6,
'subsample': 1,
'colsample_bytree': 1,
'eval_metric': 'rmse',
'seed': 0,
'silent': True}
train_data = xgb.DMatrix(data=x_train, label=y_train)
valid_data = xgb.DMatrix(data=x_valid, label=y_valid)
watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
modelixgb = xgb.train(dtrain=train_data, num_boost_round=10, evals=watchlist, early_stopping_rounds=20, verbose_eval=500, params=params)
cat_params = {'learning_rate': 1,
'depth': 2,
'allow_writing_files': False}
modelicat = CatBoostRegressor(iterations=5, **cat_params)
modelicat.fit(x_train, y_train, eval_set=(x_valid, y_valid), cat_features=[], use_best_model=True, verbose=True)
Tweaked params few times to get optimal RMSE, visualized results with eli5 and shap modules.
# Simple LightGBM Regressor to get a feel of training on 80/20 split training/vaidation data
params = {'num_leaves': 30,
'min_data_in_leaf': 20,
'objective': 'regression',
'max_depth': 5,
'learning_rate': 0.01,
"boosting": "gbdt",
"feature_fraction": 0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9,
"bagging_seed": 11,
"metric": 'rmse',
"lambda_l1": 0.2,
"verbosity": -1}
model1 = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model1.fit(x_train, y_train,
eval_set=[(x_train, y_train), (x_valid, y_valid)], eval_metric='rmse',
verbose=1000, early_stopping_rounds=200)
eli5.show_weights(model1, feature_filter=lambda x: x != '<BIAS>')
# Using shap to visualize feature importances
explainer = shap.TreeExplainer(model1, x_train)
shap_values = explainer.shap_values(x_train)
shap.summary_plot(shap_values, x_train)
# Some more visualizatons with shap
top_columns = x_train.columns[np.argsort(shap_values.std(0))[::-1]][:10]
for col in top_columns:
shap.dependence_plot(col, shap_values, x_train)
For actual training, I intend to use XGBoost, LightGBM, and CatBoost models, with K-fold cross validation with 6 splits (k=6). Will use a blended error score
Setting up 6-fold validation with shuffling
# setting up for K-fold cross validation, with 6 folds.
random_seed = 200
n_fold = 6
folds = KFold(n_splits=n_fold, shuffle=True, random_state= random_seed)
# Helper function to train various models
def train_model(X, X_test, y, params=None, folds=folds, model_type='lgb', plot_feature_importance=False, model=None):
oof = np.zeros(X.shape[0])
prediction = np.zeros(X_test.shape[0])
scores = []
feature_importance = pd.DataFrame()
# Repeat for each fold of k-folds
for fold_n, (train_index, valid_index) in enumerate(folds.split(X)):
print('Fold', fold_n, 'started at', time.ctime())
X_train, X_valid = X.values[train_index], X.values[valid_index]
y_train, y_valid = y[train_index], y[valid_index]
if model_type == 'lgb':
model = lgb.LGBMRegressor(**params, n_estimators = 20000, nthread = 4, n_jobs = -1)
model.fit(X_train, y_train,
eval_set=[(X_train, y_train), (X_valid, y_valid)], eval_metric='rmse',
verbose=1000, early_stopping_rounds=200)
y_pred_valid = model.predict(X_valid)
y_pred = model.predict(X_test, num_iteration=model.best_iteration_)
if model_type == 'xgb':
train_data = xgb.DMatrix(data=X_train, label=y_train)
valid_data = xgb.DMatrix(data=X_valid, label=y_valid)
watchlist = [(train_data, 'train'), (valid_data, 'valid_data')]
model = xgb.train(dtrain=train_data, num_boost_round=20000, evals=watchlist, early_stopping_rounds=200, verbose_eval=500, params=params)
y_pred_valid = model.predict(xgb.DMatrix(X_valid), ntree_limit=model.best_ntree_limit)
y_pred = model.predict(xgb.DMatrix(X_test.values), ntree_limit=model.best_ntree_limit)
if model_type == 'cat':
model = CatBoostRegressor(iterations=20000, eval_metric='RMSE', **params)
model.fit(X_train, y_train, eval_set=(X_valid, y_valid), cat_features=[], use_best_model=True, verbose=False)
y_pred_valid = model.predict(X_valid)
y_pred = model.predict(X_test)
oof[valid_index] = y_pred_valid.reshape(-1,)
scores.append(mean_squared_error(y_valid, y_pred_valid) ** 0.5)
prediction += y_pred
# accumulate feature imporatnces for LGB model
if model_type == 'lgb':
# feature importance
fold_importance = pd.DataFrame()
fold_importance["feature"] = X.columns
fold_importance["importance"] = model.feature_importances_
fold_importance["fold"] = fold_n + 1
feature_importance = pd.concat([feature_importance, fold_importance], axis=0)
prediction /= n_fold
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
if model_type == 'lgb':
feature_importance["importance"] /= n_fold
if plot_feature_importance:
cols = feature_importance[["feature", "importance"]].groupby("feature").mean().sort_values(
by="importance", ascending=False)[:50].index
best_features = feature_importance.loc[feature_importance.feature.isin(cols)]
plt.figure(figsize=(16, 12));
sns.barplot(x="importance", y="feature", data=best_features.sort_values(by="importance", ascending=False));
plt.title('LGB Features (avg over folds)');
return oof, prediction, feature_importance
return oof, prediction
else:
return oof, prediction
# train LightGBM model
params = {'objective':'regression',
'num_leaves' : 30,
'min_data_in_leaf' : 20,
'max_depth' : 9,
'learning_rate': 0.004,
'min_child_samples':100,
'feature_fraction':0.9,
"bagging_freq": 1,
"bagging_fraction": 0.9,
'lambda_l1': 0.2,
"bagging_seed": random_seed,
"metric": 'rmse',
'subsample':.8,
'colsample_bytree':.9,
"random_state" : random_seed,
"verbosity": -1}
oof_lgb, prediction_lgb, _ = train_model(train, test, y, params=params, model_type='lgb',
plot_feature_importance=True)
# Train XGBoost model
xgb_params = {'eta': 0.01,
'objective': 'reg:linear',
'max_depth': 6,
'subsample': 0.6,
'colsample_bytree': 0.7,
'eval_metric': 'rmse',
'seed': 25,
'silent': True}
oof_xgb, prediction_xgb = train_model(train, test, y, params=xgb_params, model_type='xgb')
# Train CatBoost model
cat_params = {'learning_rate': 0.004,
'depth': 5,
'l2_leaf_reg': 10,
'colsample_bylevel': 0.8,
'bagging_temperature': 0.2,
'od_type': 'Iter',
'od_wait': 100,
'random_seed': random_seed,
'allow_writing_files': False}
oof_cat, prediction_cat = train_model(train, test, y, params=cat_params, model_type='cat')
# prepare stack for stacked model training
train_stack = np.vstack([oof_lgb, oof_xgb, oof_cat]).transpose()
train_stack = pd.DataFrame(train_stack, columns=['lgb', 'xgb', 'cat'])
test_stack = np.vstack([prediction_lgb, prediction_xgb, prediction_cat]).transpose()
test_stack = pd.DataFrame(test_stack, columns=['lgb', 'xgb', 'cat'])
# Train stacked model
params = {'num_leaves': 8,
'min_data_in_leaf': 20,
'objective': 'regression',
'max_depth': 3,
'learning_rate': 0.01,
"boosting": "gbdt",
"bagging_seed": 11,
"metric": 'rmse',
"lambda_l1": 0.2,
"verbosity": -1}
oof_lgb_stack, prediction_lgb_stack, _ = train_model(train_stack, test_stack, y,
params=params, model_type='lgb', plot_feature_importance=True)
# setting up for K-fold cross validation, with 6 folds.
random_seed = 200
folds = 10
folds = KFold(n_splits=folds, shuffle=True, random_state= random_seed)
# Train stacked model
params = {'num_leaves': 8,
'min_data_in_leaf': 20,
'objective': 'regression',
'max_depth': 3,
'learning_rate': 0.01,
"boosting": "gbdt",
"bagging_seed": 11,
"metric": 'rmse',
"lambda_l1": 0.2,
"verbosity": -1}
oof_lgb_stack, prediction_lgb_stack, _ = train_model(train_stack, test_stack, y,
params=params, model_type='lgb', plot_feature_importance=True)
Now with training of the 3 chosen modeling techniques and stacked model with all 3 in-place, let us tackle predictions on test dataset, and see how closely they are matching with one another, and pick final model
# read submission dataset, call each of our models individually, blended, stacked...
# write outputs for each call, as well as all predictions
sub = pd.read_csv('data/sample_submission.csv')
all_pred = pd.read_csv('data/sample_submission.csv')
sub['revenue'] = np.expm1(prediction_lgb)
sub.to_csv("output/lgb.csv", index=False)
all_pred['pred_lgb'] = np.expm1(prediction_lgb)
sub['revenue'] = np.expm1(prediction_xgb)
sub.to_csv("output/xgb.csv", index=False)
all_pred['pred_xgb'] = np.expm1(prediction_xgb)
sub['revenue'] = np.expm1(prediction_cat)
sub.to_csv("output/cat.csv", index=False)
all_pred['pred_cat'] = np.expm1(prediction_cat)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub.to_csv("output/blend_lgb_xgb.csv", index=False)
all_pred['pred_blend_lgb_xgb'] = np.expm1((prediction_lgb + prediction_xgb) / 2)
sub['revenue'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub.to_csv("output/blend_all3.csv", index=False)
all_pred['pred_blend_all3'] = np.expm1((prediction_lgb + prediction_xgb + prediction_cat) / 3)
sub['revenue'] = prediction_lgb_stack
sub.to_csv("output/stack_lgb.csv", index=False)
all_pred['pred_stack_lgb'] = np.expm1(prediction_lgb_stack)
all_pred.to_csv("output/all_predictions.csv", index=False)
all_pred.info()
Visualizing the HeatMap, blend of all 3 models pred_blend_all3 is shows very high correlation, perfect correlation with 2 other models and 0.99 correlation with 2 other. I am going to pick that as my final model.
# Heatmap visualization for predictions across models
pred_viz = all_pred[['pred_lgb','pred_xgb','pred_cat','pred_blend_lgb_xgb','pred_blend_all3','pred_stack_lgb']]
f,ax = plt.subplots(figsize=(20, 10))
sns.heatmap(pred_viz.corr(), annot=True)
plt.show()
# Scatter matrix visualization of predictions across models
pd.plotting.scatter_matrix(pred_viz, alpha = 0.5, figsize = (14,8), diagonal = 'kde');